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Abstract 

This  paper  presents  the  concept  of  adaptive  programs,  whose  com¬ 
putation  and  communication  structures  can  morph  to  adapt  to  envi¬ 
ronmental  and  demand  changes  to  save  energy  and  computing  re¬ 
sources.  In  this  approach,  programmers  write  one  single  program 
using  a  language  at  a  higher  level  of  abstraction.  The  compiler 
will  exploit  the  properties  of  the  abstractions  to  generate  an  adap¬ 
tive  program  that  is  able  to  adjust  computation  and  communication 
structures  to  environmental  and  demand  changes. 

We  develop  a  technique,  called  StreaMorph,  that  exploits  the 
properties  of  stream  programs'  Synchronous  Dataflow  (SDF)  pro¬ 
gramming  model  to  enable  runtime  stream  graph  transformation. 
The  StreaMorph  technique  can  be  used  to  optimize  memory  us¬ 
age  and  to  adjust  core  utilization  leading  to  energy  reduction  by 
turning  off  idle  cores  or  reducing  operating  frequencies.  The  main 
challenge  for  such  a  runtime  transformation  is  to  maintain  con¬ 
sistent  program  states  by  copying  states  between  different  stream 
graph  structures,  because  a  stream  program  optimized  for  different 
numbers  of  cores  often  has  different  sets  of  filters  and  inter-filter 
channels.  We  propose  an  analysis  that  helps  simplify  program  state 
copying  processes  by  minimizing  copying  of  states  based  on  the 
properties  of  the  SDF  model. 

Finally,  we  implement  the  StreaMorph  method  in  the  Streamlt 
compiler.  Our  experiments  on  the  Intel  Xeon  E5450  show  that  using 
StreaMorph  to  minimize  the  number  of  cores  used  from  eight  cores 
to  one  core,  e.g.  when  streaming  rates  become  lower,  can  reduce 
energy  consumption  by  76.33%  on  average.  Using  StreaMorph  to 
spread  workload  front  four  cores  to  six  or  seven  cores,  e.g.  when 
more  cores  become  available,  to  reduce  operating  frequencies,  can 
lead  to  10%  energy  reduction.  In  addition,  StreaMorph  can  lead  to 
a  buffer  size  reduction  of  82.58%  in  comparison  with  a  straight¬ 
forward  inter-core  filter  migration  technique  when  switching  from 
using  eight  cores  to  one  core. 

1.  Introduction 

Real-time  streaming  of  media  data  is  growing  in  popularity.  This 
includes  both  capture  and  processing  of  real-time  video  and  audio, 
and  delivery  of  video  and  audio  from  servers;  recent  usage  number 
shows  over  800  million  unique  users  that  visit  YouTube  to  watch 
over  3  billion  hours  of  video  each  month  [1],  As  a  result,  YouTube 
data  centers  consume  huge  amounts  of  energy.  Given  that  traffic 
from  mobile  devices  to  YouTube  tripled  in  2011,  energy-efficient 


stream  computing  is  increasingly  important,  especially  since  bat¬ 
tery  life  is  a  big  concern  for  mobile  devices.  In  addition,  as  comput¬ 
ing  devices  are  often  now  equipped  with  high-resolution  displays 
and  high-quality  speakers,  streaming  quality  is  expected  to  rise  ac¬ 
cordingly.  Processing  high-quality  streams  requires  proportionally 
higher  computing  energy.  While  several  efforts  [15,  22]  aim  at  us¬ 
ing  parallel  computing  to  meet  the  demand  of  stream  computation, 
not  much  attention  has  been  paid  to  energy-efficient  parallel  stream 
computing.  These  efforts  mainly  focus  on  devising  scalable  compi¬ 
lation  techniques  to  speed  up  stream  computation.  While  multicore 
can  address  the  rising  computing  demand  for  stream  computation, 
it  also  can  result  in  an  unnecessary  waste  of  energy  due  to  low  core 
utilization  when  more  cores  are  used  than  needed  at  low  streaming 
rates. 

In  this  paper,  we  present  StreaMorph,  a  programming  method¬ 
ology  that  exploits  the  high-level  abstractions  of  stream  programs 
to  generate  runtime  adaptive  ones  for  energy-efficient  executions. 
Based  on  one  single  specification  of  a  stream  program,  the  com¬ 
piler  exploits  domain-specific  knowledge  to  generate  an  adaptive 
program  that  is  able  to  adjust  its  occupied  computing  resources, 
processors,  memory  and  so  on,  at  runtime.  Such  adaptive  pro¬ 
grams  help  reduce  energy  consumption  by  adapting  their  com¬ 
puting  power  to  demand  changes  at  runtime.  This  programming 
methodology  comes  from  our  new  perspective  on  stream  programs. 
The  previous  work  on  stream  compilation  assumes  stream  pro¬ 
grams  are  isolated  entities.  This  assumption  misses  an  important 
optimization  opportunity.  As  isolated  entities,  stream  programs  are 
set  to  run  as  fast  as  possible  by  using  as  much  resources  as  pos¬ 
sible.  Running  stream  programs  aggressively  may  result  in  a  con¬ 
siderable  waste  of  energy.  In  contrast,  we  treat  stream  programs 
as  integrated  components  within  certain  systems.  For  example,  an 
MPEG2  encoder/decoder  does  not  need  to  run  faster  than  its  cam¬ 
era/display  frame  rates.  Digital  signal  processing  (DSP)  programs, 
such  as  Radar,  FM  radio,  and  Vocoders,  do  not  need  to  run  faster 
than  their  input  signals’  sampling  rates.  Because  using  more  cores 
than  needed  for  certain  input  rates  to  run  these  programs  will  waste 
energy  due  to  static  energy  leakage  and  inter-core  data  communi¬ 
cation  overhead,  it  is  beneficial  to  find  the  minimal  number  of  cores 
required  for  a  specific  workload  rate  to  reduce  energy  consumption. 

Determining  the  optimal  number  of  cores  used  for  a  stream  pro¬ 
gram  at  compile-time  is  often  not  possible  due  to  several  factors. 
These  factors  include  varied  processor  speed  used  to  run  stream 
programs,  users’  control  on  stream  quality  and  speed,  varied  run¬ 
time  signal  rates,  etc.  As  a  result,  compiled  stream  applications  are 
necessary  to  be  able  to  adapt  computing  speed  accordingly  to  exter¬ 
nal  changes  at  runtime  to  save  energy.  Expressing  stream  programs 
as  synchronous  dataflow  (SDF)  [24]  models  presents  an  opportu¬ 
nity  for  such  an  energy  optimization.  In  SDF,  a  program  is  decom¬ 
posed  into  multiple  autonomous  filters  connected  using  FIFO  chan¬ 
nels.  As  a  result,  stream  programs  can  dynamically  relocate  filters 


and  channels  between  cores  at  runtime  to  minimize  the  number  of 
cores  used  while  maintaining  required  throughput  demands.  Using 
fewer  cores  reduces  leakage  energy  as  well  as  inter-core  communi¬ 
cation  energy  overhead. 

A  prior  inter-core  filter  migration  technique,  the  one  imple¬ 
mented  in  Flextream  [17]  by  Hormati  et  al.  in  the  context  of  chang¬ 
ing  processor  availability,  can  adjust  the  number  of  cores  used,  but 
it  is  not  optimal  as  it  does  not  adjust  memory  usage  accordingly.  In 
cloud  computing  as  well  as  in  multicore  embedded  systems,  pro¬ 
cessors  are  not  the  only  resource  that  applications  share.  In  these 
computing  environments,  it  is  also  desirable  to  reduce  the  mem¬ 
ory  usage  of  each  application  to  improve  the  overall  utilization  of  a 
whole  system  [11,  12], 

To  improve  inter-core  filter  migration  technique,  our  StreaMorph 
technique  not  only  migrates  filters  between  cores  but  also  trans¬ 
forms  stream  graphs,  thereby  adjusting  the  set  of  filters,  inter-filter 
channels  and  the  number  of  cores  used.  Runtime  stream  graph 
transformation  entails  mapping  and  transferring  data  on  inter-filter 
channels  of  different  stream  graph  configurations.  This  process  is 
non-trivial  because  the  inter-filer  channel  states  become  compli¬ 
cated  when  stream  programs  execute  through  several  stages  such 
as  initialization  and  software  pipelining  [14]. 

We  tackle  the  problem  by  proposing  an  analysis  that  helps  re¬ 
duce  the  number  of  tokens  copied  between  optimized  configura¬ 
tions  of  a  single  stream  program.  The  main  idea  of  the  proposed 
analysis  is  to  derive  sequences  of  filter  executions  to  drain  tokens 
on  channels  as  much  as  possible.  This  StreaMorph  scheme  helps 
optimize  energy  consumption  in  several  ways.  Either  minimizing 
the  number  of  cores  used  at  a  fixed  frequency  or  lowering  operating 
voltages  and  frequencies  of  processors  by  using  more  cores  to  han¬ 
dle  the  same  workload  rates  can  help  reduce  energy  consumption. 
In  addition,  high  processor  temperatures  due  to  running  at  high  uti¬ 
lization  for  long  periods  can  degrade  processors'  performance  and 
lifespan  [6,  13].  As  a  result,  increasing  the  number  of  cores  used 
to  lower  core  utilization,  thereby  reducing  processor  temperatures, 
can  mitigate  the  problem. 

We  apply  the  StreaMorph  scheme  to  a  set  of  streaming  bench¬ 
marks  [15]  on  the  Intel  Xeon  E5450  processors.  The  results  show 
that,  on  average,  using  StreaMorph  to  minimize  the  number  of 
cores  used  whenever  possible  from  eight  cores  to  four  reduces  en¬ 
ergy  consumption  by  29.90%,  and  from  eight  cores  to  one  reduces 
energy  consumption  by  76.33%  on  average.  Using  StreaMorph  to 
spread  workload  on  more  cores  when  they  become  available  from 
four  cores  to  six  or  seven  cores  and  applying  DVFS  can  reduce  en¬ 
ergy  consumption  by  10%.  Our  StreaMorph  scheme  also  leads  to 
an  82.58%  buffer  size  reduction  when  switching  from  an  eight-core 
configuration  to  one  compared  to  Flextream. 

This  paper  makes  the  following  contributions 

•  We  present  the  concept  of  adaptive  programs  that  are  beneficial 
in  cloud  computing  and  embedded  systems.  Our  work  demon¬ 
strates  a  case  for  employing  high-level  abstractions  to  design 
programs  that  adapt  to  demand  and  resource  changes  at  run¬ 
time. 

•  We  identify  an  energy  optimization  opportunity  by  taking  into 
account  external  settings,  e.g.  input  data  rates  or  the  number  of 
available  cores,  of  stream  programs  with  a  task-level  program 
transformation  technique. 

•  We  present  an  analysis  that  helps  simplify  program  state  copy¬ 
ing  processes  when  switching  between  configurations. 

•  We  implement  the  method  in  the  Streamlt  compiler  and  demon¬ 
strate  experimentally  the  effectiveness  of  the  method  using  a  set 
of  Streamlt  benchmarks. 

•  We  also  derive  an  approximate  energy  model  analysis  for 
stream  programs  on  multicore  and  experimentally  validate  the 


model.  The  model  can  serve  as  a  guidance  for  finding  optimal 
energy-efficient  configurations. 


2.  Adaptive  Stream  Programs 

Stream  programs  are  often  compiled  to  maximize  speed  [15, 22]  as¬ 
suming  fixed  numbers  of  cores.  This  approach  implicitly  assumes 
that  sources/sinks  of  data  for  stream  programs  can  always  pro¬ 
duce/consume  data.  This  assumption  is  not  often  true  in  practice. 
For  example,  an  audio  source  cannot  produce  data  faster  than  its 
sampling  rate.  A  sink,  e.g.  a  monitor,  does  not  consume  faster  than 
its  designated  frame  rate.  To  understand  this  situation  better,  we 
discuss  where  stream  programs  fit  into  external  settings. 

Figure  1  shows  a  general  structure  of  DSP  systems.  In  the  fig¬ 
ure,  the  analog-to-digital  converter  (ADC)  samples  analog  signals 
at  a  sampling  frequency  /  and  outputs  digital  signals  to  the  DSP 
module,  which  executes  stream  programs.  The  DSP  module  ma¬ 
nipulates  digital  signals  and  subsequently  feeds  processed  digital 
signals  to  the  digital- to-analog  converter  (DAC).  The  DAC  converts 
input  digital  signals  to  analog  signals  at  a  conversion  frequency  g. 

This  general  structure  of  DSP  systems  suggests  DSP  modules 
need  not  process  more  slowly  than  sampling  /  and/or  conversion  g 
frequencies.  It  is  generally  not  possible  to  get  ahead  of  the  source 
of  data,  when  that  source  is  a  real-time  source,  and  it  is  generally 
not  necessary  to  get  ahead  of  the  sink.  Even  with  real-time  sources 
and  sinks,  there  may  be  some  benefit  to  getting  ahead,  because 
results  can  be  buffered,  making  interruptions  less  likely  in  the 
future  when  there  is  contention  for  computing  resources.  However, 
getting  ahead  produces  no  evident  benefit  unless  such  contention 
occurs,  and  it  comes  at  an  energy  cost.  When  both  the  source 
and  the  sink  are  operating  in  real  time,  it  also  comes  at  a  cost  in 
latency.  Having  more  cores  than  necessary  can  waste  energy  due  to 
static  energy  leakage  when  idle.  In  addition,  inter-core  data  transfer 
overhead  when  using  more  cores  than  needed  can  be  another  source 
of  energy  inefficiency.  Even  when  buffering  is  used,  in  steady  state, 
it  is  beneficial  for  processing  speed  to  match  10  rates,  otherwise 
buffers  will  overflow  or  underflow. 

We  identify  two  scenarios  that  can  lead  to  the  mismatch  between 
10  rates  and  processing  speed. 

•  Varied  IO  rates  at  runtime:  Users  can  use  fast  forward  func¬ 
tionality  to  quickly  browse  through  a  video  or  audio  file. 
Graphic  applications  need  to  render  at  faster/slower  frame 
rates.  When  users  want  to  increase/decrease  songs’  tempos 
in  Karaoke  systems,  sound  synthesis  engines  have  to  adjust 
accordingly.  Sampling  rates  are  increased/decreased  to  alter 
quality. 

•  Different  processor  speeds:  Even  though  compilers  can  match 
between  processing  speed  and  certain  IO  rates  at  compile  time, 
this  optimization  can  only  be  done  for  specific  processor  mod¬ 
els.  Processors  of  the  same  instruction  set  architecture  can  vary 
in  several  dimensions  such  as  clocking  frequencies,  microarchi¬ 
tecture,  fabrication  processes.  These  variations  lead  to  wildly 
varied  execution  times  of  a  single  program  on  different  proces¬ 
sors  with  the  same  instruction  set. 


2.1  Energy  Optimization 

To  get  the  energy  benefits  of  matching  processing  speed  to  IO 
rates,  two  common  techniques  are:  1)  To  consolidate  tasks  by 
relocating  tasks  to  fewer  cores  to  improve  core  utilization  and 
enable  turning  off  idle  cores  [2,  31];  2)  To  vary  processing  speed  by 
applying  dynamic  voltage  and  frequency  scaling(DVFS)  [3,  9,  16, 
35]  or  processor  composition  [8,  21]  or  heterogeneous  processors 
of  different  speeds  [23,  26.  36], 
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Figure  1.  Basic  digital  signal  processing  structure. 


Figure  2.  Task  consolidation  and  de¬ 
composition. 


2.1.1  Task  Consolidation 

Especially  for  stream  programs,  task  consolidation  can  help  re¬ 
duce  inter-core  communication  overhead.  However,  applying  the 
task  consolidation  technique  to  stream  programs  optimized  for 
speed  is  not  straightforward.  For  example,  in  the  left  of  Figure  2, 
a  stream  program  is  composed  of  three  tasks  Tl,  T2  and  T3  con¬ 
nected  through  two  FIFO  channels.  Suppose  that  each  task  resides 
on  a  separate  core  in  a  quad-core  processor  and  utilizes  60%  of  its 
core  at  current  10  rates.  Relocating  normally  one  of  the  three  tasks 
to  another  occupied  core  will  overload  the  core.  Consequently, 
the  processing  system  cannot  guarantee  the  program’s  required  10 
rates.  Instead,  the  program  should  switch  to  a  different  configura¬ 
tion  composed  of  two  tasks  Tl  ’  and  T2  ’ ;  each  utilizes  90%  of  its 
residing  core  as  in  the  right  of  Figure  2. 

This  configuration  switching  is  complicated  because  the  two 
configurations  may  be  composed  of  different  sets  of  filters  and 
channels  due  to  speed  optimization  techniques  that  depend  on  the 
number  of  cores  used.  For  example,  the  configuration  composed 
of  Tl,  T2  and  T3  is  optimized  for  3  cores,  while  the  configuration 
composed  of  Tl  ’  and  T2  ’  is  optimized  for  2  cores. 

2.1.2  Task  Decomposition 

Task  consolidation  is  necessary  as  discussed  in  the  previous  sec¬ 
tion,  however,  there  are  a  number  of  scenarios  where  it  is  useful  to 
decompose  tasks. 

•  Suppose  that  running  Tl’  and  T2’  on  cores  1  and  2  at  90% 
utilization  drastically  increases  those  cores’  temperatures  [13, 
25].  At  this  point,  we  want  to  run  2  tasks  on  3  cores  to  reduce 
utilization  as  in  Figure  2,  and  thereby  consequently  reducing 
the  cores’  temperatures. 

•  If  the  processors  posses  a  DVFS  capability,  and  there  are  more 
cores  available  because  some  other  applications  terminated, 
decomposing  two  tasks  Tl  ’  and  T2  ’  into  three  tasks  Tl,  T2  and 
T3  and  reducing  the  operating  frequencies  of  the  processors  can 
lead  to  energy  reduction;  power  is  proportional  to  the  cube  of 
frequency. 

2.2  Adjusting  Shared  Resources  Proportionally 

Section  2.1  suggests  that  inter-core  filter  migration  to  adjust  pro¬ 
cessor  utilization  and  speed  can  help  reduce  energy  consumption. 
While  the  inter-core  filter  migration  implemented  in  Flextream  [17] 
can  adjust  the  number  of  cores  used  at  runtime,  it  is  not  optimal, 
because  the  technique  does  not  reduce  memory  usage  accordingly. 
This  limitation  reduces  the  value  for  cloud  computing,  where  appli¬ 
cations  are  allocated  cores,  memory,  networking  bandwidth,  and  so 
on,  proportionally  [11.  12];  for  example,  one  application  using  two 
cores  is  allocated  4GB  of  memory  while  another  application  using 
four  cores  is  allocated  8GB.  As  a  consequence,  just  adjusting  the 
number  of  cores  used  by  relocating  filters  between  cores  without 
adjusting  memory  usage  as  in  Flextream  [17]  is  no  longer  suffi¬ 
cient.  In  particular,  reducing  memory  usage  is  crucial  in  embedded 
systems  where  memory  is  scarce. 

In  contrast  to  Flextream  [17],  our  StreaMorph  technique  adjusts 
both  the  number  of  cores  used  and  memory  usage  simultaneously. 


We  illustrate  the  difference  using  an  example.  To  speed  up  stream 
computation,  stateless  filters  are  often  replicated  as  in  Figure  3(b)  to 
utilize  available  cores,  where  filters  Bi,  B2,  Di  and  D2  are  duplicated 
from  the  respective  ones  in  the  original  stream  graph  in  Figure  3(a). 
Duplicating  filters  is  necessary  when  stream  IO  rates  are  high 
and  we  need  to  use  more  cores  to  handle  such  high  IO  rates. 
When  IO  rates  become  low,  a  Flextream  inter-core  filter  migration 
technique  retains  the  stream  graphs  in  Figure  3(b)  and  relocates 
filters  to  reduce  the  number  of  core  used.  For  example,  originally 
the  application  in  Figure  3(b)  runs  on  two  cores,  but  now  the  input 
rate  is  lowered  and  one  core  can  handle  the  workload.  In  this  case, 
the  filters  are  relocated  in  Figure  3(b)  and  the  stream  graph  is 
not  modified.  In  contrast,  our  StreaMorph  technique  modifies  the 
stream  graph  into  the  one  in  Figure  3(a).  As  a  result,  our  technique 
would  be  more  memory-efficient  because  it  eliminates  the  buffers 
between  filters  Split,  B2,  D2,  and  Join.  In  addition,  we  will  show 
that  our  technique  also  reduces  the  buffer  sizes  of  the  other  buffers. 

3.  StreaMorph:  Designing  Adaptive  Programs 
with  High-Level  Abstractions 

Section  2  suggests  that  adapting  stream  programs  to  external 
changes  can  lead  to  energy  and  resource  reduction.  In  this  section, 
we  will  show  how  to  attain  such  adaptivity  by  exploiting  stream 
program  abstractions. 


(b)  Transformed  SDF 
model 


Figure  3.  Stream  graph  transformation 


3.1  Stream  Abstractions  and  Data  Parallelism 

3.1.1  Model  of  Computation 

The  SDF  [24]  model  of  computation  is  often  used  to  model  ex¬ 
ecutions  of  stream  programs  [33].  This  abstraction  enables  sev¬ 
eral  compiler  optimization  techniques  [15,  22]  for  buffer  space, 
scheduling  and  mapping  to  underlying  architecture.  In  the  SDF 
model  of  computation,  a  stream  program  is  given  as  a  graph  com¬ 
posed  of  a  set  of  actors,  called  filters  in  this  paper,  communicating 
through  FIFO  channels.  Each  filter  has  a  set  of  input  and  output 


ports.  Each  channel  connects  an  output  port  of  a  filter  to  an  input 
port  of  another  actor.  Each  filter  consumes/produces  a  fixed  number 
of  tokens  from  each  of  its  input/output  port  each  time  it  executes. 

Figure  3(a)  shows  an  example  of  an  SDF  stream  graph.  Filter 
A  has  two  output  ports.  Each  time  A  executes,  it  produces  2  tokens 
on  its  left  port  and  1  token  on  its  right  port.  Filter  B  consumes  1 
and  produces  2  tokens  each  time  it  executes,  and  so  on.  The  theory 
of  the  SDF  programming  model  provides  the  algorithm  to  compute 
the  number  of  times  each  actor  has  to  execute  within  one  itera¬ 
tion,  of  the  whole  stream  graph,  so  that  the  total  number  of  tokens 
produced  on  each  channel  between  two  actors  is  equal  to  the  total 
number  of  tokens  consumed.  In  other  words,  the  number  of  tokens 
on  each  channel  between  actors  remains  unchanged  after  one  itera¬ 
tion  of  the  whole  stream  graph.  For  example,  in  one  iteration  of  the 
stream  graph  in  Figure  3(a),  filters  A.  B,  C,  D,  E  have  to  execute  3, 
3,  2,  2,  2  times  respectively.  Repeating  this  basic  schedule  makes 
the  number  of  tokens  on  each  channel  remain  the  same  after  one 
iteration  of  the  whole  stream  graph.  For  instance,  in  the  channel 
between  B  and  D,  in  one  iteration,  B  produces  3x2  tokens  while  D 
consumes  2x3  tokens. 

3.1.2  Sliding  Windows 

The  Streamlt  language  [33]  is  a  language  for  writing  stream  ap¬ 
plications  extending  the  SDF  model  of  computation  with  a  sliding 
window  feature,  in  which  filters  can  peek  (read)  tokens  ahead  with¬ 
out  consuming  those  tokens.  As  a  result,  many  states  of  programs 
are  stored  on  channels  in  the  form  of  data  tokens  instead  of  being 
stored  internally  within  filters.  Consequently,  many  filters  become 
stateless  and  eligible  for  the  filter  replication  technique  to  speed 
up  computation  as  in  Section  3.1.3.  As  many  states  of  programs 
are  stored  on  channels,  it  is  easier  for  compilers  to  migrate  states 
between  different  configurations  during  the  morphing  process. 

3.1.3  Data  Parallelism  Exposed  by  Stream  Abstractions 

In  this  section,  we  will  discuss  how  the  stream  abstractions  in  the 
previous  section  can  help  adjusting  parallelism  in  stream  programs, 
thereby  adjusting  computing  speed.  To  speed  up  a  stream  applica¬ 
tion.  we  can  replicate  stateless  filters  so  that  multiple  instances  of 
one  stateless  filter  execute  in  parallel.  This  filter  replication  tech¬ 
nique  is  feasible  because  each  filter  in  an  SDF  application  con¬ 
sumes/produces  a  known  numbers  of  tokens  whenever  it  executes 
at  each  of  its  input/output  port.  As  a  result,  after  replicating  filters, 
the  compiler  knows  how  to  distribute/merge  data  tokens  to/from 
each  replicated  filter. 

Let  us  take  the  example  in  Figure  3(a)  to  illustrate  the  problem. 
Suppose  that  B  and  D  are  stateless,  as  a  result,  we  can  duplicate 
the  filters  to  obtain  the  configuration  in  Figure  3(b).  Because  B 
consumes  one  token  and  D  produces  one  token  each  time  they 
execute,  we  can  distribute  data  tokens  to  each  duplicated  B  and 
collect  data  tokens  from  each  D  evenly  in  a  round-robin  fashion 
using  Split  and  Join  filters  in  Figure  3(b).  With  this  duplication, 
we  can  get  2x  speed-up  for  the  computation  of  the  two  filters. 
Stateless  filter  replication  is  a  popular  technique  that  has  proved  to 
achieve  significant  speed-up  for  several  Streamlt  benchmarks  [15, 
22],  Stateless  filters  can  be  replicated  many  times  to  fill-up  all 
available  cores,  consequently,  this  technique  is  dependent  on  the 
number  of  cores  used. 

Execution  Scaling:  The  filter  replication  technique  may  require 
changing  the  number  of  times  each  filter  executes  within  one  iter¬ 
ation.  For  example,  A.  C,  and  E  in  Figure  3(b)  now  execute  6,  4, 
and  4  times  respectively  in  one  iteration.  This  means  those  filters 
now  execute  twice  as  often  within  one  iteration.  This  effect  is  called 
execution  scaling. 


3.2  Implementing  Adaptive  Stream  Programs 

In  the  previous  section,  we  discussed  how  the  stream  abstractions 
can  help  improving  stream  program  speed  by  exploiting  data  par¬ 
allelism.  However,  switching  between  configurations  optimized  for 
different  numbers  of  processors  is  complicated  by  several  compi¬ 
lation  techniques  applied  to  stream  programs  [14],  We  will  show 
how  to  mitigate  the  complication  in  this  section. 

3.2.1  Execution  Stages 

Initialization  Stage:  As  downstream  filters  can  peek  tokens  with¬ 
out  consuming  them,  upstream  filters  have  to  execute  a  number  of 
times  initially  in  the  initialization  stage  to  supply  more  tokens  to 
downstream  peeking  filters.  For  example,  suppose  that  filter  B  in 
Figure  3(a),  each  time  it  executes,  peeks  ahead  3  tokens  and  con¬ 
sumes  only  the  first  one  of  the  3.  Consequently,  it  requires  at  least  2 
additional  tokens  on  the  channel  between  filters  A  and  B.  To  satisfy 
this  requirement,  A  has  to  execute  twice  in  the  initialization  stage 
to  load  the  channel  with  two  tokens. 

Software  Pipelining:  Gordon  et  al.  use  software  pipelining  to  ad¬ 
dress  a  drawback  of  hardware  pipelining  [15].  In  hardware  pipelin¬ 
ing,  only  contiguous  filters  should  be  mapped  onto  one  core,  e.g. 
in  Figure  3(a),  filters  A  and  D  should  not  be  mapped  into  a  core 
when  filter  B  is  mapped  to  another  core.  This  mapping  restriction 
can  potentially  lead  to  unbalanced  allocation.  Software  pipelining 
allows  mapping  any  filters  to  any  core.  e.g.  filters  A  and  D  can  be 
mapped  into  a  core  even  when  filter  B  is  mapped  to  another  core. 
This  is  done  by  constructing  a  loop  prologue  in  which  each  filter  ex¬ 
ecutes  a  certain  number  of  times  to  make  data  tokens  available  for 
downstream  filters  to  run  in  one  iteration.  In  other  words,  within 
one  iteration,  filters  do  not  communicate  directly  to  each  other,  in¬ 
stead,  they  output  tokens  into  buffers  that  will  be  read  by  down¬ 
stream  filters  in  the  next  iteration.  Figure  4(a)  shows  how  the  pro¬ 
logue  stage  is  executed  and  its  transition  to  the  steady-state  stage. 
In  the  steady-state  stage,  filters  execute  one  stream  graph  iteration 
forming  a  steady-state  repetition.  Each  filter  is  able  to  execute  com¬ 
pletely  independently  within  one  steady  state  repetition  without  the 
need  for  waiting  for  upstream  filters  to  produce  data  in  that  repe¬ 
tition.  For  example,  for  the  model  in  Figure  3(a),  in  order  for  E  to 
execute  2  times  independently,  C  and  D  have  to  execute  twice.  Con¬ 
sequently,  B  and  A  have  to  execute  3  times.  Similarly,  for  C  and  D 
to  execute  twice  independently,  B  and  A  have  to  execute  3  times. 
Finally,  for  B  to  execute  3  times  independently,  A  has  to  execute  3 
times.  Adding  up,  in  the  prologue  stage,  A  executes  (3+3+3)  =  9 
times,  B  executes  (3+3)=6  times,  C  and  D  execute  2  times.  After  the 
loop  prologue  A,  B,  C,  D,  E  can  execute  independently  within  one 
steady-state  repetition. 

3.2.2  Deriving  Reverse  Sequences 

Normally,  a  compiled  program  repeatedly  executes  its  steady-state 
repetitions  regardless  of  external  environment  changes  as  in  Fig¬ 
ure  4(a).  This  conventional  execution  model  may  no  longer  be  ef¬ 
ficient  in  cloud  computing,  mobile  computing  and  cyber-physical 
systems  where  applications  run  in  various  environments  or  have  to 
interact  with  physical  environments.  For  example,  let  us  consider 
the  situation  when  there  are  more  cores  becoming  available  in  the 
system  because  some  other  applications  terminated.  One  possible 
way  to  save  energy  is  to  utilize  the  newly  available  cores  to  process 
a  part  of  the  workload  of  the  running  application  at  lower  frequen¬ 
cies  to  save  energy.  Suppose  utilizing  the  available  cores  would  re¬ 
quire  the  application  to  switch  from  configuration  C\  in  Figure  3(a) 
to  the  configuration  C2  in  Figure  3(b).  This  transition  takes  place 
at  the  end  of  a  repetition  in  the  Ci’s  steady-state  stage.  Note  that, 
in  software  pipelining,  the  prologue  stage  fills  the  channels  of  C\ 
with  tokens,  for  example,  the  channel  between  B  and  D  contains  6 
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Figure  4.  Execution  strategies  for  stream  programs 


tokens;  the  channel  between  D  and  E  contains  2  tokens.  As  B  and  D 
are  duplicated,  determining  how  to  copy  and  distribute  those  tokens 
on  the  corresponding  channels  in  Figure  3(b)  is  complicated.  It  is 
even  more  problematic  to  derive  from  such  a  state  of  Ci  a  sequence 
of  filter  executions  such  that,  after  executing  such  a  sequence,  filters 
in  Figure  3(b)  can  execute  in  a  software  pipelining  fashion.  Deriv¬ 
ing  such  a  switching  procedure  for  each  pair  of  configurations  may 
be  costly. 

We  simplify  the  switching  process  using  the  strategy  in  Fig¬ 
ure  4(b).  Instead  of  deriving  complicated  inter-configuration  token 
copying  procedures,  we  use  an  epilogue  stage  to  reduce  the  number 
of  copied  tokens.  The  epilogue  stage  is  derived  to  undo  the  effect  of 
the  prologue  stage.  As  a  result,  we  only  need  to  copy  the  fixed  and 
known  number  of  tokens  across  the  configurations  produced  by  the 
initialization  stage.  If  there  is  no  peeking  filter,  only  initial  tokens 
and  states  of  stateful  filters  may  still  require  copying. 

DEFINITION  1.  Sequences  of  executions  that  undo  the  effect  of 
the  program’s  current  configuration’s  prologue  stage  are  called 
reverse  sequences. 

Let  Vc  be  the  vector  of  the  number  of  tokens  on  each  channel 
of  configuration  C  right  before  its  prologue  stage. 


As  it  is  desirable  to  switch  between  configurations  as  soon  as 
possible,  we  find  the  smallest  d ki  >  0  satisfying  equation  (3)  by 
finding  the  j  such  that: 


j  =  argmax 
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Because  aki  is  an  integer  and  aki  =  — 1 - ^ - —  + 

we  can  conclude  that  (dkj  +  (j>kj  mod  SAj))  mod  SAj  = 
0.  Hence, 

•  If  pkj  mod  SAj  =  0,  we  find  the  smallest  dkj  =  0. 

•  If  Paj  mod  SAj  >  0,  we  find  the  smallest  dkj  =  SAj  — 
(Pa1  mod  sai).  It  is  easy  to  prove  that  dk ■  >  0. 

Now  we  derive  other  dki  from  equation  (3)  as  follows: 
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As 


<4+(pa-  mod  lAj) 


is  equal  to  either  0  or  1,  from  equation 


(6),  it  is  easy  to  prove  that  dki  is  an  integer.  Now  we  need  to  prove 
dAi  >  0  Vi  =  1, . . . ,  n.  From  (3)  and  (4): 


DEFINITION  2.  States  of  a  configuration  C.  whose  vectors  of  the 
numbers  of  tokens  on  the  channels  are  equal  to  Vc,  are  called  C’s 
pre-prologue  states. 

Note  that,  for  any  configuration  C,  from  a  pre-prologue  state,  if 
each  filter  executes  the  same  number  of  iterations,  C  is  again  in  a 
pre-prologue  state.  As  a  result,  a  reverse  sequence  can  be  derived 
by  finding  dk,  VA  €  C,  the  additional  number  of  times  each  filter 
has  to  execute  more  so  that  all  the  filters  will  have  executed  the 
same  number  of  iterations.  Suppose  that  number  of  iterations  is  J, 
counting  from  C’s  first  pre-prologue  state  (right  before  executing 
C’s  first  prologue  stage).  Suppose  that  C  has  n  filters  A;  where 
i  =  1, . . . ,  n,  and  each  filter  A,  executes  pAi  and  ski  times  in  its 
prologue  stage  and  one  iteration  respectively.  We  have: 


Pki  +  X  I  Ski  +  dki  —  J ski  VA i  £  C  (1) 

where  I  is  the  current  repetition  of  the  steady-state  stage  and  A'  is 
the  execution  scaling  factor.  From  equation  (1),  since  X.  I ,  and  J 
are  all  integers,  pki  +  dki  has  to  divide  ski  .  Let  aki  be  an  integer 
such  thatpA;  +  dxt  =  akiSAi-  Equation  (1)  becomes: 

&AiSAi  +  XI  Sa,-  =  dSki 

q:a,  =  J  —  XI  (2) 


Note  that  equation  (2)  holds  VA;  £  C.  Expanding  the  equation  for 
all  actors  in  the  configuration  C,  we  arrive  at:  aAl  =  a k2  =  ...  = 
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Plugging  into  equation  (5),  we  arrive  at: 
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Being  able  to  find  dAi  d°es  not  necessarily  mean  that  reverse 
sequences  always  exist;  there  may  be  additional  data  dependency 
constraints.  Within  the  SDF  literature,  even  if  we  can  find  numbers 
of  times  filters  execute  within  one  iteration,  it  is  still  possible  that 
the  stream  graph  is  not  schedulable,  e.g.  when  the  stream  graph 
has  loops  forming  circular  dependencies.  The  following  theorem 
implies  that  it  is  always  possible  to  undo  the  effect  of  prologue 
stages. 


THEOREM  1.  A  reverse  sequence  always  exists  for  a  configuration 
C  if  the  configuration  has  prologue  and  steady-state  schedules  ( an 
executed  stream  graph ). 


Proof:  We  prove  this  by  contradiction.  Suppose  that  we  cannot 
derive  a  concrete  reverse  sequence  based  on  the  values  dAi  found 
as  above.  The  only  reason  would  be  because  of  data  dependency 
between  the  executions  of  the  filters.  Let  A;  denote  the  eth  execution 
of  filter  A i  from  the  beginning  right  before  the  prologue  stage. 
There  exist  two  possible  cases: 

1 )  There  exists  a  data  dependency  loop  between  filter  executions 
A eX  -g  A^2  A  ...  A  -<  A'i21  of  m  filters.  As  the 


s  A  1 


existence  of  this  dependency  loop  depends  solely  on  the  property 
of  C,  consequently,  the  execution  of  C  will  be  stalled  due  to  the 
dependency  loop.  This  contradicts  with  the  fact  that  C  can  be 
repeated  in  the  steady-state  stage  forever. 

2)  Some  filter  B  is  at  its  fcth  execution  and  has  not  completed 
ds  executions  of  its  epilogue  stage  to  reach  its  execution  (Jss)th; 
in  other  words,  b  <  Jsb-  B  cannot  proceed  further  because  it 
requires  more  data  tokens  from  its  upstream  filter  A,  while  A  has 
completed  cIa  executions  in  its  epilogue  stage  to  reach  its  (  Jsa)’11 
execution.  This  implies  that  A,  having  executed  J  iterations,  still 
does  not  produce  enough  tokens  for  B  to  execute  J  iterations. 
This  contradicts  the  property  of  SDF  that  the  number  of  tokens 
A  produces  in  one  iteration  is  equal  to  the  number  of  tokens  B 
consumes  in  one  iteration.  □ 

We  now  need  to  derive  concrete  reverse  sequences  based  on 
d>Ai-  It  is  desirable  to  use  current  buffers  without  increasing  buffer 
sizes  for  executing  reverse  sequences.  As  each  buffer  is  large 
enough  to  store  tokens  from  the  upstream  filter  for  the  downstream 
filter  to  execute  at  least  one  iteration  even  if  the  producing  filter  has 
not  produced  any  more  tokens,  it  is  safe  to  execute  upstream  filters 
for  at  most  one  iteration  continuously.  After  that,  downstream  fil¬ 
ters  have  to  execute  to  free  up  buffers.  Algorithm  1  derives  reverse 
sequences  when  stream  graphs  do  not  contain  loops.  However, 
most  of  the  stream  benchmarks  do  not  contain  loops  [33]  and  all 
the  benchmarks  used  in  [15,  22]  are  loop-free.  When  stream  graphs 
have  loops,  the  classical  symbolic  execution  method  of  SDF  [24] 
can  be  used.  The  classical  method  may  yield  more  complicated 
sequences  as  it  randomly  selects  filters  to  execute.  This  random  ex¬ 
ecution  ordering  can  cause  severe  performance  degradation  during 
configuration  switching  due  to  losing  cache  locality.  To  mitigate 
the  potential  negative  effect  of  the  classical  symbolic  execution 
method,  the  following  algorithm  seeks  to  execute  each  filter  sev¬ 
eral  times  in  a  row  by  traversing  filters  in  dataflow  order. 

Data:  reverseMap:  Map  from  filters  to  numbers  of  reverse  executions 
schedule:  containing  prologue  and  steady-state  schedules 
Result:  reverseSeq:  reverse  sequence  of  filter  executions 
reverseSeq  <—  new  ListO 
filters  <—  getFiltersInDataf lowOrder () 
while  r ever seMap.  size  ()  >  0  do 
thisStepMap  <—  new  MapO 
for  /  £  filters  do 

nReverseExes  <—  reverseMap. get  (/) 

nlterExes  <—  sc/ie<iuZe.getNumberExesInOneIteration() 
m  =  min  {nReverseExes,  nlterExes ) 
this S tepMap. put( / ,  m) 
nReverseExesLeft  <—  nReverseExes  —  m 
if  nReverseExesLeft  >  0  then 
|  reverseM  ap.put  (/ ,  nReverseExesLeft ) 
else 

|  reverseM  ap.remove(/) 

end 

end 

reverseS  eq.B.dd^t  his  Step  Map) 

end 

return  reverseSeq 

Algorithm  1:  Deriving  reverse  execution  sequences  when  stream 
graphs  are  loop-free. 

3.2.3  Illustrative  Example 

We  illustrate  the  analysis  method  using  the  configuration  in  Fig¬ 
ure  3(a).  For  the  program:  pA=9,pB=6,  pc=Po=2,  pE=0;  sa=sb=3, 
sc=sd=se=2;  X=  1.  As  pt  ™°d  +  [§^-J=3  is  max  and  pa 

mod  sa  =  0,  we  set  dA  =  0-  Finally,  from  (6),  we  find  dB=3, 
dc=dD= 4,  dE= 6.  Readers  can  verify  that  condition  (1)  is  satisfied. 
Now,  applying  Algorithm  1,  we  find  that,  in  the  first  reverse  step, 
B,  C,  D,  E  execute  3,  2,  2,  2  times  respectively.  B  has  executed  all 
its  reverse  executions,  so  it  is  removed  from  the  reverseMap.  In 


the  second  step,  C,  D,  E  execute  2,  2,  2  times  respectively,  and  C, 
D  are  removed  from  the  reverseM ap.  In  the  last  step,  only  E  ex¬ 
ecutes  2  times  and  is  removed  from  the  reverseMap.  Now  the 
reverseMap  is  empty  so  the  algorithm  terminates.  Applying  the 
reverse  sequence  will  drain  all  the  tokens  on  the  channels  from  B  to 
D  and  from  D  to  E.  Figure  4(b)  displays  how  the  reverse  execution 
sequence  is  executed. 

3.2.4  Peeking  Token  Copying 

As  reverse  sequences  can  undo  the  effect  of  prologue  stages,  if 
stream  programs  do  not  contain  peeking  filters,  then  after  epilogue 
stages,  the  effect  of  respective  prologue  states  is  undone.  As  a 
result,  we  only  need  to  copy  filter  states,  which  are  small,  and 
initial  tokens,  which  often  do  not  exist.  When  peeking  filters  exist, 
for  example,  B  peeks  3  tokens  and  only  consumes  1  each  time  it 
executes,  the  channel  between  A  and  B  will  always  contain  tokens 
after  the  initialization  stage.  Copying  those  peeking  tokens  requires 
further  elaboration  about  how  stream  graphs  with  peeking  filters  are 
optimized. 

Efficient  Sliding  Window  Computation:  Gordon,  in  his  PhD  the¬ 
sis  [14],  presents  a  method  to  reduce  inter-core  communication  that 
degrades  performance  in  SMP  and  Tilera  machines  for  applications 
with  peeking  filters.  We  will  use  an  example  to  illustrate  Gordon’s 
method. 

Consider  the  configuration  in  Figure  3(b),  where  filter  B  is 
duplicated  into  Bi  and  B2.  Because  B  peeks  tokens,  it  is  necessary 
to  send  more  tokens  to  the  input  channels  of  replicating  filters  Bi 
and  B2  than  the  number  of  tokens  Bi  and  B2  consume.  A  fine¬ 
grained  data  parallelism  approach  where  Bi  and  B2  alternatively 
consume  tokens  from  A  at  the  step  size  of  1  will  double  the  amount 
of  communication  between  A  and  B  as  in  Figure  5(a).  For  example, 
suppose  that  in  one  iteration  A  produces  2 n  tokens  on  its  output 
channel,  and  there  are  2  more  tokens,  numbered  1  and  2,  produced 
by  A  in  the  initialization  stage.  Bi  consumes  token  1  and  reads 
tokens  2  and  3;  B2  consumes  token  2  and  reads  tokens  3  and  4; 
Bi  consumes  token  3  and  reads  tokens  4  and  5;  and  so  on.  At  the 
end  of  the  iteration  in  the  steady-state  stage,  B2  consumes  token  2 n 
and  reads  tokens  2n  +  1  and  2 n  +  2.  As  a  result,  both  Bi  and  B2 
require  2n  +  1  tokens  out  of  2 n  +  2.  The  total  amount  of  traffic  is 
therefore  4n  +  2  tokens  in  comparison  with  2n  +  2  tokens  for  the 
original  stream  graph.  This  communication  overhead  may  degrade 
performance  significantly  for  applications  with  peeking  filters  in 
SMP  and  Tilera  machines  [14], 

To  reduce  the  communication  overhead,  a  coarse-grained  data 
parallelism  approach  [14]  is  used  instead  as  in  Figure  5(b).  Now  Bi 
consumes  the  first  n  tokens  and  reads  tokens  n  +  1,  n  +  2;  B2  con¬ 
sumes  tokens  from  n  + 1  to  2 n  and  reads  tokens  2 n  + 1  and  2n  +  2. 
As  a  result,  A  only  needs  to  send  n  +  2  tokens  to  both  Bi  and  B2 
within  one  iteration  of  the  steady-state  stage.  The  amount  of  com¬ 
munication  is  therefore  2n  +  4  tokens  in  comparison  with  2n  +  2 
tokens  for  the  original  graph.  If  n  is  large,  the  overhead  becomes 
insignificant.  As  2 n  is  the  number  of  tokens  A  produces  within 
one  iteration,  we  can  scale  up  the  number  of  times  each  filter  exe¬ 
cutes  within  one  iteration  to  increase  n.  Note  that  in  Figure  5(a)(b) 
and  (c),  colored  tokens  are  peeked  only  and  never  consumed.  This 
coarse-grained  data  parallelism  approach  does  not  increase  out- 
put(joining)  buffer  sizes  over  the  fine-grained  one  because  there  are 
no  duplicated  produced  data.  As  a  result,  this  coarse-grained  data 
parallelism  approach  also  does  not  change  the  real-time  behavior 
of  a  program  as  it  does  not  requires  additional  buffer  space. 

Copying  Peeking  Tokens:  The  Streamlt  compiler  enforces  an  ad¬ 
ditional  constraint  [14]  that  the  number  of  tokens  on  the  input  chan¬ 
nel  of  a  replicated  filter  right  after  the  initialization  stage  has  to  be 
smaller  than  the  number  of  tokens  the  filters  will  consume  within 


Figure  5.  Peeking  token  copying  in  sliding  window  computation.  Each  cell  denotes  a  token.  Colored  tokens  are  peeked  only  and  never 
consumed,  (a)  Data  token  distribution  in  lined-grained  interleaving  execution  when  B  is  replicated  by  2;  (b)  Data  token  distribution  in  coarse¬ 
grained  interleaving  execution  when  B  is  replicated  by  2;  (c)  Data  token  distribution  in  coarse-grained  interleaving  when  execution  B  is 
replicated  by  3;  (d)  Copy  peeking  tokens  when  switching  configurations. 


one  iteration.  This  can  be  achieved  by  scaling  up  the  same  num¬ 
ber  of  times  each  filter  executes  in  one  iteration.  This  constraint 
implies  that,  in  pre-prologue  states,  only  the  input  channels  of  the 
first  replicated  peeking  filters  contain  tokens  regardless  of  config¬ 
urations.  As  a  result,  after  applying  reverse  sequences  to  bring  a 
configuration  to  a  pre-prologue  state,  we  only  need  to  copy  tokens 
from  the  input  channel  of  the  first  replicated  filters  in  the  current 
configuration  to  the  input  channel  of  the  respective  first  replicated 
filters  in  the  incoming  configuration.  For  example,  only  the  tokens 
in  the  input  channel  of  Bi  need  to  be  copied  to  the  input  channel  of 
Bj  as  in  Figure  5(d).  The  tokens  on  the  channel  between  A  and  C, 
due  to  the  executions  of  A  in  the  initialization  stage,  only  need  to 
be  copied  if  the  two  configurations  contain  different  versions  of  A 
and  C  or  the  channel  between  A  and  C  needs  to  be  reallocated  for 
a  larger  buffer  size.  As  the  number  of  peeking  filters  and  the  num¬ 
ber  of  peeking  tokens  are  small  [33],  it  would  take  an  insignificant 
amount  of  time  to  copy. 


4.  Task  Decomposition  for  Low-Power 

As  discussed  in  Section  2.1.2,  when  more  cores  become  avail¬ 
able  because  some  other  programs  terminated,  if  processors  possess 
DVFS  capability,  we  can  reduce  energy  consumption  by  decompos¬ 
ing  tasks,  thereby  spreading  workload  on  more  cores,  and  lowering 
operating  voltages  and  frequencies.  We  will  derive  an  analysis  that 
helps  justify  the  hypothesis  supported  by  the  experimental  results 
in  Section  5.2. 

We  employ  the  energy  model  for  one  CPU  from  [29]: 


Pcpu  —  Pdynamic  T  Pstatic  —  ( CeVcpufcpu )  +  (Ol  Vcpu  +  Oln)  (9) 


Based  on  equation  (9),  the  dynamic  energy  consumption  P„  if 
computation  is  spread  on  n  cores  is: 


Pn 


VZfn 


(10) 


If  we  assume  that  inter-core  communication  consumes  very  lit¬ 
tle  energy  in  comparison  with  computational  energy  or  the  same  as 
intra-core  communication,  spreading  stream  graphs  on  more  cores 
enables  lowering  operating  frequencies  and  voltages,  although  it  in¬ 
creases  Ci-  Suppose  that  we  can  run  a  stream  application  at 

a  specific  IO  rate  with  two  configurations  of  n  and  m  cores,  where 
n  >  m.  From  equation  (10),  we  arrive  at: 


Pn  _  n  /K \2  /u 

Pm  771  y  Vm  )  fm 


(11) 


Note  that  the  n-core  configuration  is  supposed  to  be  as  fast  as  the 
m- core  configuration.  As  a  result,  it  requires  that  the  number  of 
instructions  delivered  by  n  cores  in  one  second  be  equal  to  that 
of  m  cores:  nfnIPCavg  «  rnfmIPCa.vg,  where  IPCavS  is  the 


average  number  of  instructions  per  cycle  and  it  should  be  the  same 
for  the  two  configurations  because  both  configurations  run  the  same 
workload  targeting  the  same  IO  rates.  Equivalently,  ^  «  p . 
Plugging  into  (11),  we  arrive  at: 


(12) 


From  [18],  frequencies  relate  to  voltages  as  follows: 


/  °< 


(VcPu  -  Vt)1 

Vcpu 


(13) 


where  Vt  is  the  threshold  voltage  of  transistors  and  7  depends 
on  the  carrier  velocity  saturation  and  lies  between  1.2  to  1.6  for 
the  current  technologies.  As  a  result,  running  on  more  cores  with 
lower  frequencies  can  reduce  energy  consumption.  For  example, 
771  <C  71  =£*  fn  <  fm  Vn  <C  Vm  — ^  Pn  A  Pm • 

Note  that  this  is  a  simplified  energy  model  for  stream  applica¬ 
tions  on  multicore  machines.  This  model  assumes  that  computation- 
communication  ratios  are  large  enough  to  approximately  ignore 
inter-core  communication  energy  or  inter-core  communication  con¬ 
sumes  the  same  amounts  of  energy  as  intra-core  communication. 
The  first  assumption  depends  on  benchmarks’  characteristics  while 
the  second  assumption  is  often  not  true  in  practice.  This  approxima¬ 
tion  analysis  serves  as  a  predictive  model  to  explained  the  results 
in  Section  5.2.  In  addition,  using  more  cores  would  increase  over¬ 
all  static  power  and  reduce  operating  voltages.  Because  reducing 
operating  voltages  lowers  static  energy  consumption  as  in  equation 
(9),  as  a  consequence,  when  taking  both  static  and  dynamic  power 
into  account,  we  need  to  make  sure  that  the  static  power  increment 
due  to  additional  used  cores  is  lower  than  the  static  and  dynamic 
power  saved  by  lowering  voltage  and  frequency.  A  detailed  proces¬ 
sor  power  model  can  help  determine  optimal  configurations. 


5.  Evaluations 

We  implement  the  StreaMorph  scheme  in  the  Streamlt  com¬ 
piler  [14].  To  model  input  data  token  streams  controlled  externally 
at  certain  rates,  we  instrument  code  of  source  filters  with  the  token- 
bucket  mechanism  [32],  Whenever  a  source  filter  wants  to  send  out 
a  number  of  data  tokens,  it  has  to  acquire  the  same  number  of  rate 
control  tokens  in  a  bucket.  If  the  bucket  does  not  have  enough  rate 
control  tokens,  the  thread  of  the  source  filter  will  sleep  and  wait 
until  there  are  enough  rate  tokens  in  the  bucket.  A  timer  interrupt 
periodically  fills  the  bucket  with  rate  control  tokens  at  rate  r  and 
wakes  up  the  waiting  thread.  Our  coarse-grain  level  implementa¬ 
tion  of  this  mechanism  has  a  negligible  effect  on  performance. 

We  run  our  experiments  on  a  server  with  two  Intel  Xeon  E5450 
quad-core  CPUs  operating  at  two  frequencies  2GHz  and  3GHz.  We 
use  a  power  meter  to  measure  the  power  consumption  of  the  whole 


system.  The  system  has  a  typical  idle  power  consumption  Pidie  ~ 
228  watts.  For  each  benchmark,  we  measure  the  dynamic  power 
consumption.  Pioad ,  computed  using  the  following  equation: 

Pload  —  Pineasure  Pidie  (14) 

We  use  the  same  set  of  benchmarks  in  other  papers  [15.  22], 
Most  of  the  benchmarks  are  in  the  DSP  domain.  Each  benchmark 
is  compiled  into  a  program  composed  of  several  configurations, 
where  each  configuration  is  specific  to  a  number  of  cores. 

5.1  Energy  Reduction  by  Task  Consolidation 

In  this  section,  we  evaluate  the  effectiveness  of  the  task  consolida¬ 
tion  scheme  using  the  StreaMorph  technique.  Suppose  that  when 
the  IO  rates  of  a  stream  program  are  reduced,  the  processors  be¬ 
come  under-utilized.  As  a  result,  we  can  morph  the  stream  graph 
of  the  application  to  minimize  the  number  of  cores  used  to  reduce 
Pioad',  cores  are  run  at  3GHz.  Figure  6(a)  and  Figure  6(b)  show 
the  effectiveness  of  consolidating  tasks  using  StreaMorph  to  min¬ 
imize  the  number  of  cores  used  to  one  and  four  cores  respectively 
when  input  rates  become  low  enough.  Concretely,  morphing  from 
eight  cores  to  four  cores  reduces  energy  consumption  by  29.90% 
on  average.  Morphing  from  eight  cores  to  one  core  reduces  energy 
consumption  by  76.33%  on  average. 

5.2  Energy  Reduction  by  Task  Decomposition 

This  section  demonstrates  experimentally  the  effectiveness  of  the 
task  decomposition  scheme  with  StreaMorph.  When  there  are  more 
available  cores  in  the  system,  because  some  other  program  termi¬ 
nates,  the  analysis  in  Section  4  suggests  we  should  transfer  a  part 
of  the  workload  to  the  newly  available  cores  to  lower  operating 
voltages  and  frequencies  of  all  the  cores  to  save  energy.  For  each 
benchmark,  we  use  the  workload  rate  that  can  be  handled  by  4  cores 
at  3GHz.  We  use  the  StreaMorph  technique  to  switch  each  appli¬ 
cation  to  a  new  configuration  using  more  cores,  such  that  the  new 
configuration  can  still  handle  the  same  workload  rate  at  a  lowered 
frequency,  say  2GHz.  Figure  7(a)  shows  the  measured  energy  con¬ 
sumptions  and  Figure  7(b)  shows  the  energy  reduction  percentage 
gained  by  task  decomposition.  On  average,  the  energy  reduction 
percentage  is  around  10%. 

Our  experiment  also  shows  that  this  method  is  only  effective 
for  the  benchmarks  that  have  high  computation-communication  ra¬ 
tios  [14];  in  this  experiment,  the  ratio  is  greater  than  100.  The  rea¬ 
son  for  this  result  is  that  the  benchmarks  with  small  computation- 
communication  ratios  would  incur  significant  inter-core  communi¬ 
cation  energy  overhead  compared  to  computational  energy. 

5.3  StreaMorph  vs.  Flextream  Task  Migration 

In  the  previous  sections,  we  demonstrated  how  StreaMorph  can 
help  save  energy.  However,  how  is  StreaMorph  compared  to  the 
straightforward  filter  migration  scheme  implemented  in  Flex¬ 
tream  [17]  by  Hormati  et  al.?  Figure  8  shows  the  advantage  of 
StreaMorph  over  a  Flextream  filter  migration  scheme  in  reduc¬ 
ing  buffer  sizes  when  switching  from  multiple  cores  to  one  core. 
For  example,  when  switching  from  eight  cores  to  one  core,  be¬ 
cause  StreaMorph  transforms  the  stream  graph  structures  of  ap¬ 
plications,  it  reduces  the  buffer  sizes  by  82.58%  on  average  over 
Flextream,  which  does  not  modify  the  stream  graph  structures. 
Even  when  switching  from  two  cores  to  one  core,  StreaMorph  can 
help  reduce  buffer  sizes  by  57.62%.  Especially,  for  benchmarks 
ChannelVocoder,  FMRadio,  FilterBank,  StreaMorph  can  help 
reduce  buffer  sizes  more  substantially,  while  it  is  not  useful  in 
the  case  of  TDE  because  TDE  does  not  require  the  filter-replication 
technique  even  for  eight  cores. 

In  addition.  Flextream  does  not  allow  the  optimizations  depen¬ 
dent  on  the  number  of  cores  as  described  in  Section  3.1.3.  Further¬ 


more,  in  Flextream,  filters  are  not  fused  to  allow  fine-grain  filter 
migration  to  avoid  the  situation  in  Figure  2,  while  filter  fusion  is 
beneficial  to  reduce  synchronization  and  communication  between 
filters  [15].  As  a  consequence,  Flextream  suffers  around  9%  per¬ 
formance  penalty  from  the  optimal  configurations  [17],  Our  exper¬ 
iment  comparing  the  performances  of  StreaMorph  and  the  Flex¬ 
tream  filter  migration  technique  for  switching  from  five  cores  to 
three  cores  also  result  in  the  same  result.  To  save  space,  we  do  not 
present  the  result  here. 

5.4  Switching  Time 

We  have  shown  that  switching  between  configurations  can  help 
reduce  energy  consumption  of  multicore  systems.  However,  if  it 
takes  too  long  to  switch  between  configurations,  QoS  of  stream 
programs  may  suffer.  It  is  desirable  to  measure  the  switching  time 
of  the  system.  A  configuration  switch  is  composed  of  three  steps: 
the  epilogue  stage  of  the  current  configuration,  state  copying,  and 
the  prologue  stage  of  the  incoming  configuration.  The  following 
equation  shows  how  switching  time  is  broken  down: 

tswitching  —  tepilogue  T"  Estate  — copying  T"  tprologue  (15) 


Prologue  time  (/is) 

Epilogue  time  (/is)  1 

Copy  size  (bytes)  1 

Benchmarks 

Max 

Average 

Max 

Average 

Token 

State 

BitonicSort 

716 

126 

27 

22 

0 

4 

ChannelVocoder 

2999 

2477 

45703 

39271 

8820 

252 

DCT 

28 

21 

104 

68 

0 

4 

DES 

210 

153 

148 

115 

0 

4 

eft 

52 

38 

37 

31 

0 

4 

FilterBank 

1442 

1121 

12385 

7217 

1984 

508 

FMRadio 

1582 

988 

56575 

26974 

492 

508 

MPEG2Decoder 

116 

75 

176 

125 

0 

4 

Radar 

123 

104 

347 

305 

0 

1032 

Serpent 

2303 

853 

648 

548 

0 

0 

TDE 

5 

3 

770 

361 

0 

4 

Table  1.  Switching  time  statistics. 

We  run  each  benchmark  and  measure  tepilogue  and  tprologue 
on  1  to  8  cores.  To  save  space,  we  only  report  the  maximum 
and  average  measured  values  for  each  benchmark  in  Table  1.  As, 
t state- copying  ~  0,  it  is  often  too  small  to  measure  exactly,  we 
instead  report  the  number  of  bytes  to  copy  for  each  benchmark  in 
Table  1 .  We  can  use  the  data  from  the  table  to  compute  maximum 
amounts  of  time  to  switch  from  one  configuration  to  another  con¬ 
figuration.  For  example,  for  the  DES  to  switch  from  3  cores  to  7 
cores,  tepilogue  for  3  cores  is  bounded  by  148/rs  and  tproiogue  for 
7  cores  is  bounded  by  210/is,  and  t state- copying  =  0  because  there 
is  no  token  that  needs  to  copy.  The  total  switching  time  is  smaller 
than  or  equal  to  148  +  0  +  210  =  358/rs  for  DES.  We  can  see  that 
switching  times  are  small  enough  not  to  degrade  user  experience. 

In  addition,  note  that  during  the  configuration  switching  pro¬ 
cess,  processors  still  do  useful  work  such  as  running  epilogue  and 
prologue  stages,  and  as  a  result,  the  overall  performance  would  not 
be  affected  too  much. 

6.  Lessons  Learned 

Designing  the  StreaMorph  scheme,  we  realized  a  number  of  prin¬ 
ciples  for  designing  adaptive  programs: 

•  Modularity:  Applications  should  be  decomposed  into  subpro¬ 
cesses  to  allow  task  migration  between  cores. 

•  Functional  subprocesses:  Ideally,  the  subprocesses  should  be 
designed  to  be  stateless  to  expose  parallelism,  e.g.  by  replicat¬ 
ing  stateless  processes. 

•  Internal  state  exposure:  States  of  programs  should  be  exposed 
by  storing  on  external  queues  instead  of  within  subprocesses  to 
facilitate  state  migration  while  morphing  programs. 


Power  consumption  at  1-core  workload  rates 


Power  consumption  at  4-core  workload  rates 


(a)  Energy  consumption  when  input  rates  saturate  one  core. 


(b)  Energy  consumption  when  input  rates  saturate  four  cores. 


Figure  6.  Energy  reduction  by  task  consolidation. 
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Buffer  size  reduction:  StreaMorph  vs.  Flextream  filter  migration 
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Figure  8.  Buffer  reduction. 


Figure  7.  Energy  reduction  by  task  decomposition. 


•  Predictable  inter-process  communication  rates:  This  property 
eases  program  state  transferring  processes,  e.g.  by  reducing  the 
amount  of  copied  state. 

These  principles  point  to  other  domains  for  adaptive  programs.  We 
find  we  can  apply  the  adaptive  program  concept  to  programming 
models  such  as  the  Cilk  [10],  PetaB ricks  [4]  and  SEDA  [34], 

7.  Related  Work 

The  Hormati  et  al.’s  Flextream  [17]  work  is  closely  related  to  our 
work.  Hormati  et  al.  present  a  method  that  can  efficiently  reparti¬ 
tion  stream  programs  to  adapt  dynamically  to  environment  changes 
such  as  the  number  of  available  cores.  The  major  drawback  of  the 
Flextream  is  that  it  does  not  reduce  memory  usage  proportionally 
when  reducing  the  number  of  cores  used  as  shown  in  Section  5.3 
as  well  as  9%  performance  degradation.  In  addition,  we  not  only 
present  a  method  enabling  the  above  adaptations  but  also  show  how 
to  use  the  method  to  reduced  energy  consumption  of  stream  pro¬ 
grams  by  putting  stream  programs  into  external  settings.  Aleen  et 
al.  [2]  propose  a  method  to  dynamically  predict  running  times  of 
portions  of  streaming  programs  based  on  input  values.  Portions  of 
programs  are  dynamically  relocated  across  cores  based  on  predic¬ 
tion  results  to  balance  workload  on  multicore.  Task  consolidation 
has  been  deployed  in  the  Linux  kernel  to  reduce  energy  consump¬ 
tion  by  relocating  workloads  to  fewer  cores  to  allows  other  cores  to 
turn  into  deep  sleep  states  [31].  Besides  the  task  consolidation  tech¬ 
nique,  adjusting  processors’  speed  using  the  DVFS  capability  to 
computation  demands  is  another  popular  technique.  Choi  et  al.  [9] 
present  a  method  to  reduce  energy  consumption  of  a  MPEG  de¬ 
coder  program  by  adapting  processor  frequencies  to  MPEG  video 
frame  rates.  The  dynamic  knobs  framework  [16]  dynamically  ad¬ 
justs  processor  frequencies  based  on  QoS  requirements  to  save  en¬ 
ergy.  In  [5],  Baek  and  Chilimbi  present  a  framework  that  compile 


programs  into  adaptive  ones  that  can  return  approximate  results 
based  on  QoS  requirements  to  reduce  energy  consumption. 

Hardware  energy-efficient  research  has  been  focusing  design¬ 
ing  processors  that  can  adapt  themselves  to  QoS  requirements.  Ex¬ 
ecuting  code  regions  of  certain  characteristics  to  suitable  cores  in 
heterogeneous  multicore  systems  composed  of  cores  with  different 
points  of  energy/performance  sharing  the  same  ISA  to  save  energy 
has  been  explored  in  [23,  26,  36].  Another  approach  to  energy- 
efficient  computing  is  to  exploit  the  DVFS  capability  of  modern 
processors  [3,  35],  Burger  et  al.  proposed  an  Explicit  Data  Graph 
Execution  (EDGE)  architecture  [8]  to  reduce  energy  consumption 
by  getting  rid  of  complicated  speculation  circuits  inside  modern 
processors  and  using  compiler  techniques  instead. 

Our  work  is  also  related  to  the  fair  multiple  resource  sharing 
problem  [11,  12]  in  cloud  computing.  Our  StreaMorph  scheme, 
when  reducing  the  number  of  cores  used,  also  reduces  memory  us¬ 
ages  accordingly.  This  feature  makes  our  StreaMorph  scheme  more 
suitable  for  cloud  computing  than  the  Flextream  approach  [17]. 

In  [28],  Parhi  and  Messerschmitt  present  a  method  for  find¬ 
ing  multiprocessor  rate-optimal  schedules  for  data  flow  programs. 
Rate-optimal  schedules  help  programs  achieve  minimal  periods  for 
iterations  given  an  infinite  number  of  cores.  Renfors  and  Neuvo’s 
framework  [30]  focuses  on  determining  the  minimal  sampling  pe¬ 
riods  for  a  given  digital  filter  structure  assuming  the  speed  of  arith¬ 
metic  operations  is  known  and  the  number  of  processing  units  is 
infinite.  This  line  of  work  is  different  from  our  work  in  the  sense 
that,  they  assume  static  streaming  rates  and  applications  are  opti¬ 
mized  for  specific  hardware  platforms  while  in  cloud  computing, 
mobile  computing  and  cyber-physical  systems,  applications  run  on 
a  wide  variety  of  underlying  hardware. 

Finally,  our  work  derives  from  the  work  by  the  Streamlt  com¬ 
piler  group  [14,  15,  33].  However,  we  focus  on  the  energy-efficient 
aspect  instead  of  speed  optimization  [15,  22].  Our  analysis  de- 


pends  on  the  static  properties  of  the  SDF  to  derive  sequences  of 
filter  executions  that  drain  tokens  on  channels.  Deriving  such  re¬ 
verse  sequences  for  more  expressive  stream  models  of  computa¬ 
tions  such  as  Kahn  process  networks  [19]  is  problematic  due  to 
unknown  traffic  patterns  between  processes.  Although  this  work 
is  within  the  SDF  domain,  the  process  migration  technique  to  im¬ 
prove  core  utilization  and  reduce  energy  consumption  is  applicable 
to  other  streaming  languages  as  well  [7.  20,  27]. 

8.  Conclusion 

We  have  made  a  case  for  exploiting  high-level  abstractions  to  de¬ 
sign  adaptive  programs  by  presenting  our  StreaMorph  technique 
for  stream  programs.  We  have  shown  that  high-level  abstractions 
can  help  design  adaptive  programs.  The  concept  of  adaptive  pro¬ 
grams  proposed  in  this  paper  is  important  in  cloud  computing,  mo¬ 
bile  computing,  and  cyber-physical  systems  when  applications  can 
be  dynamically  deployed  and  migrated  on  a  wide  variety  of  envi¬ 
ronments.  Morphing  programs  can  also  help  isolate  performance  of 
applications,  thereby  improving  QoS  of  applications. 

This  work  can  be  extended  in  several  directions.  Our  next  step 
would  be  predicting  the  number  of  cores  necessary  for  a  given  10 
rate.  We  plan  to  apply  the  adaptive  program  concept  to  applications 
with  implicit  parallelism  with  multiple  algorithm  choices,  e.g.  in 
PetarB ricks  [4],  so  that  applications  can  execute  adaptively  under 
fair-multiple-resource  constraints  [11,  12]. 

Our  evaluations  use  Intel  Xeon  processors,  it  would  be  more 
interesting  if  our  evaluations  are  done  using  multimedia  processors, 
however,  at  the  time  the  paper  is  written,  we  do  not  have  a  multicore 
multimedia  processor  platform  at  hand.  We  also  have  not  explored 
the  idea  of  using  configuration  switching  to  lower  core  utilization 
to  protect  processors  from  overheating. 
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